Goto

Collaborating Authors

 fact and rule


Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

Wang, Yucheng, Hou, Yifan, Javadov, Aydin, Akhtar, Mubashara, Sachan, Mrinmaya

arXiv.org Artificial Intelligence

Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.


Symbolic Working Memory Enhances Language Models for Complex Rule Application

Wang, Siyuan, Wei, Zhongyu, Choi, Yejin, Ren, Xiang

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown remarkable reasoning performance but struggle with multi-step deductive reasoning involving a series of rule application steps, especially when rules are presented non-sequentially. Our preliminary analysis shows that while LLMs excel in single-step rule application, their performance drops significantly in multi-step scenarios due to the challenge in rule grounding. It requires anchoring the applicable rule and supporting facts at each step, amidst multiple input rules, facts, and inferred facts. To address this, we propose augmenting LLMs with external working memory and introduce a neurosymbolic framework for rule application. The memory stores facts and rules in both natural language and symbolic forms, enabling precise tracking. Utilizing this memory, our framework iteratively performs symbolic rule grounding and LLM-based rule implementation. The former matches predicates and variables of symbolic rules and facts to ground applicable rules at each step. Experiments indicate our framework's effectiveness in rule application and its robustness across various steps and settings~\footnote{Code and data are available at \url{https://github.com/SiyuanWangw/RuleApplication}.}.


Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies

McGinness, Lachlan, Baumgartner, Peter

arXiv.org Artificial Intelligence

This study presents the first examination of the ability of Large Language Models (LLMs) to follow reasoning strategies that are used to guide Automated Theorem Provers (ATPs). We evaluate the performance of GPT4, GPT3.5 Turbo and Google's recent Gemini model on problems from a steamroller domain. In addition to determining accuracy we make use of the Natural Language Processing library spaCy to explore new methods of investigating LLM's reasoning capabilities. This led to one alarming result, the low correlation between correct reasoning and correct answers for any of the tested models. We found that the models' performance when using the ATP reasoning strategies was comparable to one-shot chain of thought and observe that attention to uncertainty in the accuracy results is critical when drawing conclusions about model performance. Consistent with previous speculation we confirm that LLMs have a preference for, and are best able to follow, bottom up reasoning processes. However, the reasoning strategies can still be beneficial for deriving small and relevant sets of formulas for external processing by a trusted inference engine.


LAMBADA: Backward Chaining for Automated Reasoning in Natural Language

Kazemi, Mehran, Kim, Najoung, Bhatia, Deepti, Xu, Xin, Ramachandran, Deepak

arXiv.org Artificial Intelligence

Remarkable progress has been made on automated reasoning with natural text, by using Language Models (LMs) and methods such as Chain-of-Thought and Selection-Inference. These techniques search for proofs in the forward direction from axioms to the conclusion, which suffers from a combinatorial explosion of the search space, and thus high failure rates for problems requiring longer chains of reasoning. The classical automated reasoning literature has shown that reasoning in the backward direction (i.e. from the intended conclusion to supporting axioms) is significantly more efficient at proof-finding. Importing this intuition into the LM setting, we develop a Backward Chaining algorithm, called LAMBADA, that decomposes reasoning into four sub-modules. These sub-modules are simply implemented by few-shot prompted LM inference. We show that LAMBADA achieves sizable accuracy boosts over state-of-the-art forward reasoning methods on challenging logical reasoning datasets, particularly when deep and accurate proof chains are required.


Saha

AAAI Conferences

Government regulations are critical to understanding how to do business with a government entity and receive other benefits. However, government regulations are also notoriously long and organized in ways that can be confusing for novice users. Developing cognitive assistance tools that remove some of the burden from human users is of potential benefit to a variety of users. The volume of data found in United States federal government regulation suggests a multiple-step approach to process the data into machine-readable text, create an automated legal knowledge base capturing various facts and rules, and eventually building a legal question and answer system to acquire understanding from various regulations and provisions. Our work discussed in this paper represents our initial efforts to build a framework for Federal Acquisition Regulations System (Title 48, Code of Federal Regulations) in order to create an efficient legal knowledge base representing relationships between various legal elements, semantically similar terminologies, deontic expressions and cross-referenced legal facts and rules.


multiPRover: Generating Multiple Proofs for Improved Interpretability in Rule Reasoning

Saha, Swarnadeep, Yadav, Prateek, Bansal, Mohit

arXiv.org Artificial Intelligence

We focus on a type of linguistic formal reasoning where the goal is to reason over explicit knowledge in the form of natural language facts and rules (Clark et al., 2020). A recent work, named PRover (Saha et al., 2020), performs such reasoning by answering a question and also generating a proof graph that explains the answer. However, compositional reasoning is not always unique and there may be multiple ways of reaching the correct answer. Thus, in our work, we address a new and challenging problem of generating multiple proof graphs for reasoning over natural language rule-bases. Each proof provides a different rationale for the answer, thereby improving the interpretability of such reasoning systems. In order to jointly learn from all proof graphs and exploit the correlations between multiple proofs for a question, we pose this task as a set generation problem over structured output spaces where each proof is represented as a directed graph. We propose two variants of a proof-set generation model, multiPRover. Our first model, Multilabel-multiPRover, generates a set of proofs via multi-label classification and implicit conditioning between the proofs; while the second model, Iterative-multiPRover, generates proofs iteratively by explicitly conditioning on the previously generated proofs. Experiments on multiple synthetic, zero-shot, and human-paraphrased datasets reveal that both multiPRover models significantly outperform PRover on datasets containing multiple gold proofs. Iterative-multiPRover obtains state-of-the-art proof F1 in zero-shot scenarios where all examples have single correct proofs. It also generalizes better to questions requiring higher depths of reasoning where multiple proofs are more frequent. Our code and models are publicly available at https://github.com/swarnaHub/multiPRover


Learn Prolog Language by Creating an Expert System

#artificialintelligence

Prolog is a declarative programming language which is a short form of PROgramming LOGic. A declarative language is a language in which a programmer specifies a goal to be achieved and prolog system works out how to achieve it. Here is the wikipedia for Prolog you can learn more about: https://en.wikipedia.org/wiki/Prolog SWI Prolog is an IDE on which it is super easy to write and run prolog code. After a successful installation when you run it by clicking on the red owl icon it'll open this: This is the prolog console which will be used to execute and run our prolog code.


Can machines have common sense? – Moral Robots – Medium

#artificialintelligence

The Cyc project (initially planned from 1984 to 1994) is the world's longest-lived AI project. The idea was to create a machine with "common sense," and it was predicted that about 10 years should suffice to see significant results. That didn't quite work out, and today, after 35 years, the project is still going on -- although by now very few experts still believe in the promises made by Cyc's developers. Common sense is more than just explaining the meaning of words. For example, we have already seen how "sibling" or "daughter" can be explained in Prolog with a dictionary-like definition.


A Redefinition of Arguments in Defeasible Logic Programming

Viglizzo, Ignacio Darío (Universidad Nacional del Sur, Bahía Blanca, Argentina) | Tohmé, Fernando (Universidad Nacional del Sur, Bahía Blanca) | Simari, Guillermo (Universidad Nacional del Sur, Bahía Blanca)

AAAI Conferences

Defeasible Logic Programming (DELP) is a formalism that extends declarative programming to capture defeasible reasoning. Its inference mechanism, upon a query on a literal in a program, answers by indicating whether or not it is warranted in an argumentation process. While the properties of DELP are well known, some of its basic elements can be redefined in order to shed light on some of the subtleties of the warrant process. We will discuss these alternative definitions and the cases in which they provide a better performance.